[Chapter Fifteen][Previous]
[Next] [Art of
Assembly][Randall Hyde]
Art of Assembly: Chapter Fifteen
- 15.3 - Character String Functions
- 15.3.1 - Substr
- 15.3.2 - Index
- 15.3.3 - Repeat
- 15.3.4 - Insert
- 15.3.5 - Delete
- 15.3.6 - Concatenation
15.3 Character String Functions
Most high level languages, like Pascal, BASIC, "C", and PL/I,
provide several string functions and procedures (either built into the language
or as part of a standard library). Other than the five string operations
provided above, the 80x86 doesn't support any string functions. Therefore,
if you need a particular string function, you'll have to write it yourself.
The following sections describe many of the more popular string functions
and how to implement them in assembly language.
15.3.1 Substr
The Substr
(substring) function copies a portion of one
string to another. In a high level language, this function usually takes
the form:
DestStr := Substr(SrcStr,Index,Length);
where:
DestStr
is the name of the string variable where you want
to store the substring,
SrcStr
is the name of the source string (from which the
substring is to be taken),
Index
is the starting character position within the string
(1..length
(SrcStr
)), and
- Length is the length of the substring you want to copy into
DestStr
.
-
The following examples show how Substr
works.
SrcStr := 'This is an example of a string';
DestStr := Substr(SrcStr,11,7);
write(DestStr);
This prints 'example'. The index value is eleven, so, the Substr
function will begin copying data starting at the eleventh character in the
string. The eleventh character is the 'e' in 'example'. The length of the
string is seven.
This invocation copies the seven characters 'example' to DestStr
.
SrcStr := 'This is an example of a string';
DestStr := Substr(SrcStr,1,10);
write(DestStr);
This prints 'This is an'. Since the index is one, this occurrence of the
Substr
function starts copying 10 characters starting with
the first character in the string.
SrcStr := 'This is an example of a string';
DestStr := Substr(SrcStr,20,11);
write(DestStr);
This prints 'of a string'. This call to Substr
extracts the
last eleven characters in the string.
What happens if the index and length values are out of bounds? For example,
what happens if Index
is zero or is greater than the length
of the string? What happens if Index
is fine, but the sum of
Index
and Length
is greater than the length of
the source string? You can handle these abnormal situations in one of three
ways: (1)ignore the possibility of error; (2)abort the program with a run-time
error; (3)process some reasonable number of characters in response to the
request.
The first solution operates under the assumption that the caller never makes
a mistake computing the values for the parameters to the Substr
function. It blindly assumes that the values passed to the Substr
function are correct and processes the string based on that assumption.
This can produce some bizarre effects. Consider the following examples,
which use length-prefixed strings:
SourceStr :='1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ';
DestStr := Substr(SourceStr,0,5);
Write('DestStr');
prints '$1234'. The reason, of course, is that SourceStr
is
a length-prefixed string. Therefore the length, 36, appears at offset zero
within the string. If Substr
uses the illegal index of zero
then the length of the string will be returned as the first character. In
this particular case, the length of the string, 36, just happened to correspond
to the ASCII code for the '$' character.
The situation is considerably worse if the value specified for Index
is negative or is greater than the length of the string. In such a case,
the Substr
function would be returning a substring containing
characters appearing before or after the source string. This is not a reasonable
result.
Despite the problems with ignoring the possibility of error in the Substr
function, there is one big advantage to processing substrings in this manner:
the resulting Substr
code is more efficient if it doesn't have
to perform any run-time checking on the data. If you know that the index
and length values are always within an acceptable range, then there is no
need to do this checking within Substr
function. If you can
guarantee that an error will not occur, your programs will run (somewhat)
faster by eliminating the run-time check.
Since most programs are rarely error-free, you're taking a big gamble if
you assume that all calls to the Substr
routine are passing
reasonable values. Therefore, some sort of run-time check is often necessary
to catch errors in your program. An error occurs under the following conditions:
- The index parameter (
Index
) is less than one.
Index
is greater than the length of the string.
- The
Substr
length parameter (Length
) is greater
than the length of the string.
- The sum of
Index
and Length
is greater than
the length of the string.
-
An alternative to ignoring any of these errors is to abort with an error
message. This is probably fine during the program development phase, but
once your program is in the hands of users it could be a real disaster.
Your customers wouldn't be very happy if they'd spent all day entering data
into a program and it aborted, causing them to lose the data they've entered.
An alternative to aborting when an error occurs is to have the Substr
function return an error condition. Then leave it up to the calling code
to determine if an error has occurred. This technique works well with the
third alternative to handling errors: processing the substring as best you
can.
The third alternative, handling the error as best you can, is probably the
best alternative. Handle the error conditions in the following manner:
- The index parameter (
Index
) is less than one. There are
two ways to handle this error condition. One way is to automatically set
the Index
parameter to one and return the substring beginning
with the first character of the source string. The other alternative is
to return the empty string, a string of length zero, as the substring. Variations
on this theme are also possible. You might return the substring beginning
with the first character if the index is zero and an empty string if the
index is negative. Another alternative is to use unsigned numbers. Then
you've only got to worry about the case where Index
is zero.
A negative number, should the calling code accidentally generate one, would
look like a large positive number.
- The index is greater than the length of the string. If this is the case,
then the
Substr
function should return an empty string. Intuitively,
this is the proper response in this situation.
- The
Substr
length parameter (Length
) is greater
than the length of the string. -or-
- The sum of
Index
and Length
is greater than
the length of the string. Points three and four are the same problem, the
length of the desired substring extends beyond the end of the source string.
In this event, Substr
should return the substring consisting
of those characters starting at Index
through the end of the
source string.
-
The following code for the Substr
function expects four parameters:
the addresses of the source and destination strings, the starting index,
and the length of the desired substring. Substr
expects the
parameters in the following registers:
ds:si- The address of the source string.
es:di- The address of the destination string.
ch- The starting index.
cl- The length of the substring.
Substr returns the following values:
- The substring, at location
es:di
.
Substr
clears the carry flag if there were no errors. Substr
sets the carry flag if there was an error.
Substr
preserves all the registers.
-
If an error occurs, then the calling code must examine the values in si,
di
and cx
to determine the exact cause of the error
(if this is necessary). In the event of an error, the Substr
function returns the following substrings:
- If the
Index
parameter (ch
) is zero, Substr
uses one instead.
- The
Index
and Length
parameters are both unsigned
byte values, therefore they are never negative.
- If the
Index
parameter is greater than the length of the
source string, Substr
returns an empty string.
- If the sum of the
Index
and Length
parameters
is greater than the length of the source string, Substr
returns
only those characters from Index
through the end of the source
string. The following code realizes the substring function.
-
; Substring function.
; Substring function.
;
; HLL form:
;
;procedure substring(var Src:string;
; Index, Length:integer;
; var Dest:string);
;
; Src- Address of a source string.
; Index- Index into the source string.
; Length- Length of the substring to extract.
; Dest- Address of a destination string.
;
; Copies the source string from address [Src+index] of length
; Length to the destination string.
;
; If an error occurs, the carry flag is returned set, otherwise
; clear.
;
; Parameters are passed as follows:
;
; DS:SI- Source string address.
; ES:DI- Destination string address.
; CH- Index into source string.
; CL- Length of source string.
;
; Note: the strings pointed at by the SI and DI registers are
; length-prefixed strings. That is, the first byte of each
; string contains the length of that string.
Substring proc near
push ax
push cx
push di
push si
clc ;Assume no error.
pushf ;Save direction flag status.
; Check the validity of the parameters.
cmp ch, [si] ;Is index beyond the length of
ja ReturnEmpty ; the source string?
mov al, ch ;See if the sum of index and
dec al ; length is beyond the end of the
add al, cl ; string.
jc TooLong ;Error if > 255.
cmp al, [si] ;Beyond the length of the source?
jbe OkaySoFar
; If the substring isn't completely contained within the source
; string, truncate it:
TooLong: popf
stc ;Return an error flag.
pushf
mov al, [si] ;Get maximum length.
sub al, ch ;Subtract index value.
inc al ;Adjust as appropriate.
mov cl, al ;Save as new length.
OkaySoFar: mov es:[di], cl ;Save destination string length.
inc di
mov al, ch ;Get index into source.
mov ch, 0 ;Zero extend length value into CX.
mov ah, 0 ;Zero extend index into AX.
add si, ax ;Compute address of substring.
cld
rep movsb ;Copy the substring.
popf
SubStrDone: pop si
pop di
pop cx
pop ax
ret
; Return an empty string here:
ReturnEmpty: mov byte ptr es:[di], 0
popf
stc
jmp SubStrDone
SubString endp
15.3.2 Index
The Index
string function searches for the first occurrence
of one string within another and returns the offset to that occurrence.
Consider the following HLL form:
SourceStr := 'Hello world';
TestStr := 'world';
I := INDEX(SourceStr, TestStr);
The Index
function scans through the source string looking
for the first occurrence of the test string. If found, it returns the index
into the source string where the test string begins. In the example above,
the Index
function would return seven since the substring 'world'
starts at the seventh character position in the source string.
The only possible error occurs if Index
cannot find the test
string in the source string. In such a situation, most implementations return
zero. Our version will do likewise. The Index
function which
follows operates in the following fashion:
1) It compares the length of the test string to the length of the source
string. If the test string is longer, Index
immediately returns
zero since there is no way the test string will be found in the source string
in this situation.
2) The index
function operates as follows:
i := 1;
while (i < (length(source)-length(test)) and
test <> substr(source, i, length(test)) do
i := i+1;
When this loop terminates, if (i < length(source)-length(test)) then
it contains the index into source where test begins. Otherwise test is not
a substring of source. Using the previous example, this loop compares test
to source in the following manner:
i=1
test: world No match
source: Hello world
i=2
test: world No match
source: Hello world
i=3
test: world No match
source: Hello world
i=4
test: world No match
source: Hello world
i=5
test: world No match
source: Hello world
i=6
test: world No match
source: Hello world
i=7
test: world Match
source: Hello world
There are (algorithmically) better ways to do this comparison[7],
however, the algorithm above lends itself to the use of 80x86 string instructions
and is very easy to understand. Index
's code follows:
; INDEX- computes the offset of one string within another.
;
; On entry:
;
; ES:DI- Points at the test string that INDEX will search for
; in the source string.
; DS:SI- Points at the source string which (presumably)
; contains the string INDEX is searching for.
;
; On exit:
;
; AX- Contains the offset into the source string where the
; test string was found.
INDEX proc near
push si
push di
push bx
push cx
pushf ;Save direction flag value.
cld
mov al, es:[di] ;Get the length of the test string.
cmp al, [si] ;See if it is longer than the length
ja NotThere ; of the source string.
; Compute the index of the last character we need to compare the
; test string against in the source string.
mov al, es:[di] ;Length of test string.
mov cl, al ;Save for later.
mov ch, 0
sub al, [si] ;Length of source string.
mov bl, al ;# of times to repeat loop.
inc di ;Skip over length byte.
xor ax, ax ;Init index to zero.
CmpLoop: inc ax ;Bump index by one.
inc si ;Move on to the next char in source.
push si ;Save string pointers and the
push di ; length of the test string.
push cx
rep cmpsb ;Compare the strings.
pop cx ;Restore string pointers
pop di ; and length.
pop si
je Foundindex ;If we found the substring.
dec bl
jnz CmpLoop ;Try next entry in source string.
; If we fall down here, the test string doesn't appear inside the
; source string.
NotThere: xor ax, ax ;Return INDEX = 0
; If the substring was found in the loop above, remove the
; garbage left on the stack
FoundIndex: popf
pop cx
pop bx
pop di
pop si
ret
INDEX endp
15.3.3 Repeat
The Repeat
string function expects three parameters- the
address of a string, a length, and a character. It constructs a string of
the specified length containing "length" copies of the specified
character. For example, Repeat(STR,5,'*')
stores the string
'*****' into the STR
string variable. This is a very easy string
function to write, thanks to the stosb
instruction:
; REPEAT- Constructs a string of length CX where each element
; is initialized to the character passed in AL.
;
; On entry:
;
; ES:DI- Points at the string to be constructed.
; CX- Contains the length of the string.
; AL- Contains the character with which each element of
; the string is to be initialized.
REPEAT proc near
push di
push ax
push cx
pushf ;Save direction flag value.
cld
mov es:[di], cl ;Save string length.
mov ch, 0 ;Just in case.
inc di ;Start string at next location.
rep stosb
popf
pop cx
pop ax
pop di
ret
REPEAT endp
15.3.4 Insert
The Insert
string function inserts one string into another.
It expects three parameters, a source string, a destination string, and
an index. Insert
inserts the source string into the destination
string starting at the offset specified by the index parameter. HLLs usually
call the Insert
procedure as follows:
source := ' there';
dest := 'Hello world';
INSERT(source,dest,6);
The call to Insert
above would change source to contain the
string 'Hello there world'. It does this by inserting the string ' there'
before the sixth character in 'Hello world'.
The insert procedure using the following algorithm:
Insert(Src,dest,index);
1) Move the characters from location dest+index
through the
end of the destination string length
(Src
) bytes
up in memory.
2) Copy the characters from the Src
string to location
dest+index
.
3) Adjust the length of the destination string so that it is the sum of
the destination and source lengths. The following code implements this algorithm:
; INSERT- Inserts one string into another.
;
; On entry:
;
; DS:SI Points at the source string to be inserted
;
; ES:DI Points at the destination string into which the source
; string will be inserted.
;
; DX Contains the offset into the destination string where the
; source string is to be inserted.
;
;
; All registers are preserved.
;
; Error condition-
;
; If the length of the newly created string is greater than 255,
; the insert operation will not be performed and the carry flag
; will be returned set.
;
; If the index is greater than the length of the destination
; string,
; then the source string will be appended to the end of the destin- ; ation string.
INSERT proc near
push si
push di
push dx
push cx
push bx
push ax
clc ;Assume no error.
pushf
mov dh, 0 ;Just to be safe.
; First, see if the new string will be too long.
mov ch, 0
mov ah, ch
mov bh, ch
mov al, es:[di] ;AX = length of dest string.
mov cl, [si] ;CX = length of source string.
mov bl, al ;BX = length of new string.
add bl, cl
jc TooLong ;Abort if too long.
mov es:[di], bl ;Update length.
; See if the index value is too large:
cmp dl, al
jbe IndexIsOK
mov dl, al
IndexIsOK:
; Now, make room for the string that's about to be inserted.
push si ;Save for later.
push cx
mov si, di ;Point SI at the end of current
add si, ax ; destination string.
add di, bx ;Point DI at the end of new str.
std
rep movsb ;Open up space for new string.
; Now, copy the source string into the space opened up.
pop cx
pop si
add si, cx ;Point at end of source string.
rep movsb
jmp INSERTDone
TooLong: popf
stc
pushf
INSERTDone: popf
pop ax
pop bx
pop cx
pop dx
pop di
pop si
ret
INSERT endp
15.3.5 Delete
The Delete
string removes characters from a string. It
expects three parameters - the address of a string, an index into that string,
and the number of characters to remove from that string. A HLL call to Delete
usually takes the form:
Delete(Str,index,length);
For example,
Str := 'Hello there world';
Delete(str,7,6);
This call to Delete
will leave str containing 'Hello world'.
The algorithm for the delete operation is the following:
1) Subtract the length parameter value from the length of the destination
string and update the length of the destination string with this new value.
2) Copy any characters following the deleted substring over the top of the
deleted substring.
There are a couple of errors that may occur when using the delete procedure.
The index value could be zero or larger than the size of the specified string.
In this case, the Delete
procedure shouldn't do anything to
the string. If the sum of the index and length parameters is greater than
the length of the string, then the Delete
procedure should
delete all the characters to the end of the string. The following code implements
the Delete
procedure:
; DELETE - removes some substring from a string.
;
; On entry:
;
; DS:SI Points at the source string.
; DX Index into the string of the start of the substring
; to delete.
; CX Length of the substring to be deleted.
;
; Error conditions-
;
; If DX is greater than the length of the string, then the
; operation is aborted.
;
; If DX+CX is greater than the length of the string, DELETE only
; deletes those characters from DX through the end of the string.
DELETE proc near
push es
push si
push di
push ax
push cx
push dx
pushf ;Save direction flag.
mov ax, ds ;Source and destination strings
mov es, ax ; are the same.
mov ah, 0
mov dh, ah ;Just to be safe.
mov ch, ah
; See if any error conditions exist.
mov al, [si] ;Get the string length
cmp dl, al ;Is the index too big?
ja TooBig
mov al, dl ;Now see if INDEX+LENGTH
add al, cl ;is too large
jc Truncate
cmp al, [si]
jbe LengthIsOK
; If the substring is too big, truncate it to fit.
Truncate: mov cl, [si] ;Compute maximum length
sub cl, dl
inc cl
; Compute the length of the new string.
LengthIsOK: mov al, [si]
sub al, cl
mov [si], al
; Okay, now delete the specified substring.
add si, dx ;Compute address of the substring
mov di, si ; to be deleted, and the address of
add di, cx ; the first character following it.
cld
rep movsb ;Delete the string.
TooBig: popf
pop dx
pop cx
pop ax
pop di
pop si
pop es
ret
DELETE endp
15.3.6 Concatenation
The concatenation operation takes two strings and appends one to the
end of the other. For example, Concat('Hello ','world')
produces
the string 'Hello world'. Some high level languages treat concatenation
as a function call, others as a procedure call. Since in assembly language
everything is a procedure call anyway, we'll adopt the procedural syntax.
Our Concat
procedure will take the following form:
Concat(source1,source2,dest);
This procedure will copy source1
to dest
, then
it will concatenate source2
to the end of dest
.
Concat
follows:
; Concat- Copies the string pointed at by SI to the string
; rointed at by DI and then concatenates the string;
; pointed at by BX to the destination string.
;
; On entry-
;
; DS:SI- Points at the first source string
; DS:BX- Points at the second source string
; ES:DI- Points at the destination string.
;
; Error condition-
;
; The sum of the lengths of the two strings is greater than 255.
; In this event, the second string will be truncated so that the
; entire string is less than 256 characters in length.
CONCAT proc near
push si
push di
push cx
push ax
pushf
; Copy the first string to the destination string:
mov al, [si]
mov cl, al
mov ch, 0
mov ah, ch
add al, [bx] ;Compute the sum of the string's
adc ah, 0 ; lengths.
cmp ax, 256
jb SetNewLength
mov ah, [si] ;Save original string length.
mov al, 255 ;Fix string length at 255.
SetNewLength: mov es:[di], al ;Save new string length.
inc di ;Skip over length bytes.
inc si
rep movsb ;Copy source1 to dest string.
; If the sum of the two strings is too long, the second string
; must be truncated.
mov cl, [bx] ;Get length of second string.
cmp ax, 256
jb LengthsAreOK
mov cl, ah ;Compute truncated length.
neg cl ;CL := 256-Length(Str1).
LengthsAreOK: lea si, 1[bx] ;Point at second string and
; ; skip the string length.
cld
rep movsb ;Perform the concatenation.
popf
pop ax
pop cx
pop di
pop si
ret
CONCAT endp
[7] The interested reader should look up the
Knuth-Morris-Pratt algorithm in "Data Structure Techniques" by
Thomas A. Standish. The Boyer-Moore algorithm is another fast string search
routine, although somewhat more complex.
- 15.3 - Character String Functions
- 15.3.1 - Substr
- 15.3.2 - Index
- 15.3.3 - Repeat
- 15.3.4 - Insert
- 15.3.5 - Delete
- 15.3.6 - Concatenation
Art of Assembly: Chapter Fifteen - 28 SEP 1996
[Chapter Fifteen][Previous]
[Next] [Art of
Assembly][Randall Hyde]